Building reputation systems for better ranking 
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How to rank web pages, scientists and online resources has recently attracted increasing attention 
from both physicists and computer scientists. In this paper, we study the ranking problem of rating 
systems where users vote objects by discrete ratings. We propose an algorithm that can simultane- 
ously evaluate the user reputation and object quality in an iterative refinement way. According to 
both the artificially generated data and the real data from MovieLens and Amazon, our algorithm 
can considerably enhance the ranking accuracy. This work highlights the significance of reputation 
systems in the Internet era and points out a way to evaluate and compare the performances of 
different reputation systems. 

PACS numbers: 89.20.Hh, 89.65.Gh, 89.70.+C, 89.75.-k 



I. INTRODUCTION 

Ranking may not be the best way to describe a system, 
but definitely provides valuable and impressive informa- 
tion, especially for the people who do not comprehen- 
sively understand the internal interactions and organi- 
zation of this system. Nowadays, ranking techniques arc 
becoming increasingly important in many online services, 
and we are always curious for rankings of web pages, 
books, scientists, movies, movie stars, and so on. For 
a simple undirected graph, the centralities are usually 
used to rank the importance of nodes [H, while for di- 
rected graph, PagcRank is the most widely applied algo- 
rithm who mimics the random walk process with restart 
• Considering a possibly underlying mixing role of each 
node, the HITS algorithm Q may provide better ranking. 
Recently, some scientists proposed a number of iterative 
refinement algorithms to rank the scientists and scientific 
publications based on the citation and co-authorship data 

In this paper, we consider the ranking problem in a dif- 
ferent kind of systems called the rating systems, where 
each user vote some objects with ratings (usually discrete 
ratings from 1 to 5, like in Netfiix.com and Amazon.com). 
A straightforward method is to rank objects according to 
their average ratings. However, a drawback is that some 
users are not serious to their votes at all, therefore the 
evaluation by simply averaging all ratings may be less 
accurate. A promising way to overcome this problem is 
to estimate the reputation or trust of each user and to 
assign more weight to the user with higher reputation. In 
fact, to build reputation systems or reputation societies 
is a vital task in the Internet era 0, which could find 
its applications in personalized recommendations Q , 
management of peer to pee r systems [Tol - [r^ , online sales 
in c-commerce systems [ij, [iJI , design of mobile ad- hoc 
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networks [15|, and so on. However, to estimate the rep- 
utation of a user is not a trivial task. Yu et al. [l^, [T^l 
proposed an iterative refinement algorithm, where the 
quality of an object is quantified by its weighted average 
rating and a user whose ratings are closer to the weighted 
average ratings is considered to be of higher reputation. 
A user having higher reputation will be assigned more 
weight. At each time step, every user's reputation and 
every object's weighted average rating are recalculated, 
until the system converges to steady distributions of rep- 
utations and weighted average ratings. To achieve better 
estimation of user reputation and object quality, the ba- 
sic iterative refinement model can be further extended by 
accounting for the truncation of the rating and by assum- 
ing a prior distribution on the parameters according to a 
Bayesian model [T8| . Similar problems based on partial 
information [l^ and changing data [20| have also been 
considered. 

Most of the previous works used artificially generated 
data to evaluate the algorithmic performance. In this 
paper, beyond the artificial data, we use real data to test 
a modified iterative algorithm. The winners of the Best 
Picture of Oscar Awards among the movies in MovieLens 
data and the winners of the National Book Awards among 
the books in Amazon.com are treated as benchmark ob- 
jects. Experimental analysis shows that our modified al- 
gorithm gives considerably higher ranks of the bench- 
mark objects than the average ratings. 



II. METHOD 

A rating system consists of N users and M objects, 
where each user rates some objects. Denoting by p 
(0 < p < 1) the density of ratings (each user has voted 
pM objects on average), Xik the rating of object k by user 
i, and Qk the intrinsic quality of object k which is usually 
not observable. If Qk is known, the mean square devia- 
tion of user i's votes from the objects' intrinsic qualities 
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FIG. 1: 5 and r as functions of a, where Q and ( obey the 
uniform distribution. The rating density is fixed as p = 0.05. 
All data points are obtained by averaging fOO independent 
realizations. 



is: 




where k runs over all the Mi objects voted by user i. 
We assume that the user with higher reputation has av- 
eragely smaller cr, namely higher reputation corresponds 
to better judgement of the intrinsic qualities of objects. 
However, the intrinsic qualities can not be observed di- 
rectly, and thus we can only estimate them based on the 
users' ratings. Instead of simply averaging over all rat- 
ings, in our reputation system, we assign higher opinion 
weight to the user with higher reputation. Denoting by 
the mean square deviation and thus the reputation 
of user «, we assign a weight to user i with a > a 
free parameter, and thus the estimated quality of object 
k, measured by the weighted average rating, is 

where Nk is the number of users having voted object k 
and z, j run over all these Nk users. At the same time, the 
mean square deviation of user I's ratings can be estimated 
as 

jr^(^^k-qk)^, (3) 

where k runs over all the M,; objects voted by user i. 
When < 10~^, we set = 10~^ to avoid divergence. 

Equations (2) and (3) describe an iterative refinement 
method to estimate the user reputation and object qual- 
ity. We set the initial condition as V^^i = 1, and at each 



FIG. 2: S and r as functions of a, where Q obeys the power- 
law distribution p{Q) ~ Q^^'^ and ^ obeys the uniform dis- 
tribution. The rating density is fixed as p = 0.05. All data 
points are obtained by averaging 100 independent realiza- 
tions. 

time step we first estimate by Eq. (2) and then update 
by Eq. (3). The maximal difference for q and ^ at the 
nth time step is defined as: 

Aq{n) = max \qk{n) - qk{n - (4) 

k 

Ae(n) ==max|^,(n)-^,(n-l)|. (5) 

i 

The iterative process stops when both Ag and are 
smaller than the threshold Ac = 10~^, and the resulted 
q and ^ are used to rank the object quality and user 
reputation, respectively. 

III. RESULTS OF ARTIFICIAL DATA 

In this section, we test our algorithm by artificial sys- 
tem where the numbers of users and objects are fixed as 
N = 2000 and M = 1000. We first generate the intrin- 
sic qualities of objects Q and the noise levels of users' 
judgements C according to some given distributions (see 
later) . Here the known (exact) qualities and mean square 
derivations are denoted by Q and a (later we will sec that 
in the statistical level ai ^ Q), while the estimated val- 
ues are q and ^. Then for each user-object (i — fc) pair, 
with probability p, we generate the artificial rating Xik 
as 

Xik = Qk + ^Ki^ (6) 

where f/' G [— 1, 1] is a random variable. The lower and 
upper boundaries of the rating system are set as and 
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TABLE I: Basic statistics of real data. 
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FIG. 3: S and r as functions of a, where Q obeys the power- 
law distribution p{Q) ~ Q^^'^ and ^ obeys the uniform dis- 
tribution. The squares, circles and triangles represent the 
results for p = 0.01, p = 0.05 and p = 0.10, respectively. All 
data points are obtained by averaging 100 independent real- 
izations. The results with Q obeying the uniform distribution 
are qualitatively the same, as thus are omitted here. 



5, namely if Xik is smaller than we reset it as and if 
it is larger than 5 we reset it as 5. According to Eq. (1), 
in the statistical level, ai ^ (f. 

Initially we set V^^i = 1 and then apply the iterative 
algorithm described in Eqs. (2) and (3). After we obtain 
the convergent q and ^, we use standard deviation to 
quantify to what extent our method can uncover the the 
intrinsic qualities of objects: 



S = 



\ 



1 



(7) 



1=1 



Clearly, a smaller S corresponds to better algorithmic 
performance. Besides, we use a correlation measure 
called Kendall' Tau [21[ to judge whether our algorithm 
has successfully revealed the hidden rank of users' repu- 
tations. For two lists, Y and Z, with length L, r is given 
as: 



L{L-1) 



E' 

i<j 



sgn[(Y,-Y,){Z,-Z,)l (8) 



where sgn(a;) = 1 for a: > 0, sgn(2;) ~ —1 for a; < 
and sgn(x) = for a; = 0. The value of r ranges from 
-1-1 (exactly the same order of the two lists Y and Z) to 
-1 (completely reverse order of the two lists), and r « 
for uncorrelated lists. Clearly, a larger r corresponds to 
better algorithmic performance. 

To our knowledge, there is no empirical analysis about 
the quantitative distribution of people's judgements, so 
we simply assume that obeys a uniform distribution in 



Data Set 


M 


N 


S 


P 


Amazon 


10000 


16311 


189 


0.0002 


Movielens 


3900 


6040 


74 


0.0425 



the range [0,5]. For the qualities of objects, we test on 
two kinds of distributions: the uniform distribution and 
the power-law distribution p{Q) ^ Q^^'^. We adopt the 
latter distribution because for many user-object bipartite 
systems the degrees of objects are very heterogeneous 
[22| , indicating that the qualities of objects may be also 
heterogeneous. The value of Q is also restricted in the 
range [0, 5]. 

Figure 1 and Figure 2 respectively report the algo- 
rithmic performance with different distributions of ob- 
ject qualities. Although the shapes oi 5 — a curves and 
T ~ a curves in Fig. 1 and Fig. 2 are different in some 
details, both figures clearly show the advantage of our 
algorithm. Compared with the simple average (i.e., the 
case of a = 0), our algorithm can provide considerably 
better evaluations on user reputation and object quality. 
We next study the effects of rating density on algorithmic 
performance. As shown in Fig. 3, the algorithm performs 
better for denser data but the qualitative features do not 
change for different p. 



IV. EXPERIMENTAL RESULTS 

In this section, we test our algorithm on two real data 
sets: MovieLens (http://www.grouplens.org/) and Ama- 
zon (http://www.amazon.com/). The former consists of 
6040 users and 3900 movies, and the latter consists of 
16311 users and 10000 books (the Amazon data was col- 
lected from July 2005 to September 2005). AU the rat- 
ings on movies and books are discrete integers from 1 to 
5. Since in the real world, the users' reputations and ob- 
jects' qualities could never be exactly observed or quanti- 
fied, we are not able to test the algorithmic performance 
in a direct way. Instead, wc first select a subset of objects 
as benchmark ones that are known to be of high quality, 
and then see whether our algorithm assigns in average 
higher ranks to these benchmark objects than the sim- 
ple average of ratings. We apply the AUC statistics [l^l 
to evaluate our algorithm, which is the probability a ran- 
domly selected benchmark object is assigned topper rank 
than a randomly selected non-benchmark one, as 



AUC 



-T 



S 



M - R, 
M-S 



(9) 



where S denotes the number of benchmark objects, i runs 
over all benchmark objects and 1 < i?,; < M is the rank 
of object i. A completely random order of objects cor- 
responds to AUC = 0.5, therefore, the degree to which 
AUC exceeds 0.5 indicates how much better the algo- 
rithm performs than pure chance. 74 movies winning the 
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FIG. 4: AUC value as a function of a for Amazon (a) and 
MovieLens (b). Results are obtained by averaging over 100 
independent realizations since the objects with the same q 
value may be assigned different orders in different realizations. 

Best Picture of Oscar Awards and 189 books winning the 
National Book Awards are selected to be the benchmark 
objects for MovieLens and Amazon, respectively. The 
basic statistics of real data sets are shown in Tabic 1. 

Figure 4 reports the experimental results. Although 
the shapes of AUC — a curves are different for MovieLens 
and Amazon (they are also different from the artificial 
systems), our algorithm outperforms the simple average 
in both two data sets. In accordance with the results of 
artificial data, the sparser the ratings are, the smaller the 
AUC is. 



V. CONCLUSION AND DISCUSSION 

As stated by Masum and Zhang 0, how to quantify 
people's reputation is an urgent challenge in the Internet 



era. For example, spammers intentionally produce noisy 
and evil information that misleads our judgement, and 
the well-designed reputation systems can dig out these 
nasty users or reduce their impacts. In this paper, we fo- 
cus on the bipartite rating systems, and design an itera- 
tive refinement method to evaluate the users' reputations 
and objects' qualities. According to both the artificially 
generated data and the real data, our algorithm could 
considerably improve the evaluation accuracy. In addi- 
tion, the method adopted to test the algorithm for real 
data (a similar method is reported very recently in Ref. 
0) suggests a good platform for the quantitative compe- 
tition of different ranking algorithms. To our knowledge, 
although some reputation-based ranking algorithms have 
been proposed previously [Tol-fisi [20| , no empirical com- 
parison between them has been reported yet, and it is 
not easy to say one algorithm could beat another without 
a reasonable metric on algorithmic performance for real 
data. Thanks to the increasing number of available data 
sets and the metric suggested in this paper, extensive 
comparison between various algorithms become feasible 
[2^ , from which we hope the effectiveness and efficiency 
of related algorithms can be largely improved in the near 
future. 
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